Data Origins

The raw data for this project was downloaded from the GOV.UK website.

More information about the variables can be found in the codebook.txt file in the data folder.

Data is available from this link:

https://www.gov.uk/government/publications/deaths-associated-with-neurological-conditions

The URL for these pages is:

https://evie-lois.github.io/deaths-assoc-neuro/

The repository for these pages is:

https://github.com/evie-lois/deaths-assoc-neuro.git

Research Question

Which neurological condition(s) has the highest rate of associated deaths from 2001 to 2014?

The visualisation will attempt to show the number of deaths associated with neurological conditions between the years 2001 to 2014, in order to visually represent the neurological condition with the highest rate of associated deaths.

Data Preparation

In order to have my data ready for visualisation I needed to clean and reshape the data. As the dataset I had contained a table with two tables embedded I had to make these into one complete dataset. In the process, I removed outliers and repeated columns in order to prepare the data. The columns were also renamed into codes so they would be easier to work with.

#Split embedded table into two

first_table <- data[1:16, ]

second_table <-data[17:nrow(data), ]

# Structure tables, exclude outliers 

first_table <- first_table[-c(16), ]

second_table <- second_table[-c(16:31),]

# Swap the first row of each table to become the column names

colnames(first_table) <- first_table[1, ]

first_table <- first_table[-1, ]

colnames(second_table) <-second_table[1, ]

second_table <- second_table[-1, ]

# Clean data, removing outliers

data1_cleaned <- first_table[, -c(3,5,7,9,11,13,15,17,19)]

data1_cleaned <- data1_cleaned[, -c(2,11)]

data2_cleaned <- second_table[, -c(3,5,7,9,11,13,15,17,19)]

data2_cleaned <- data2_cleaned[, -c(8,10)]

# Combine tables

combined_table <- bind_cols(data1_cleaned, data2_cleaned)
## New names:
## • `Year of the registration of death` -> `Year of the registration of
##   death...1`
## • `Year of the registration of death` -> `Year of the registration of
##   death...10`
# Clean new dataset

cleaned_dataset <- combined_table[,-c(10)]

#Code columns

cols <- c("year", "dwnc", "epi", "mndsma", "msid", "nmd", "poed", "tbsi", "totns", "at", "cnsi", "cnd", "dd", "fd", "ham", "raond", "smar")

colnames(cleaned_dataset) <- cols

#Reshape the data

reshaped_dataset <- cleaned_dataset %>%
  pivot_longer(cols = -year, 
               names_to = "Condition",
               values_to = "Deaths")

#Check for missing values in the dataset

sum(is.na(reshaped_dataset))
## [1] 0
#Check structure and data types

str(reshaped_dataset)
## tibble [224 × 3] (S3: tbl_df/tbl/data.frame)
##  $ year     : chr [1:224] "2001" "2001" "2001" "2001" ...
##  $ Condition: chr [1:224] "dwnc" "epi" "mndsma" "msid" ...
##  $ Deaths   : chr [1:224] "23051" "1936" "1574" "1301" ...
#Check number of rows and columns

dim(reshaped_dataset)
## [1] 224   3
#Show cleaned and reshaped dataset

head(reshaped_dataset)
## # A tibble: 6 × 3
##   year  Condition Deaths
##   <chr> <chr>     <chr> 
## 1 2001  dwnc      23051 
## 2 2001  epi       1936  
## 3 2001  mndsma    1574  
## 4 2001  msid      1301  
## 5 2001  nmd       563   
## 6 2001  poed      6963

Once the data is prepared, visualisation can begin.

Visualisation

I created my visualisation using ggplot and plotly. For my visualisation I wanted to represent the data so it would be clear as to which neurological condition had the highest rate of associated deaths. I chose a horizontal view as the data presented better this way. I included the use of ggplotly to allow for interaction of the data to view in more detail.

# Checking 'Deaths' column is numeric in order to plot data

str(reshaped_dataset$Deaths) 
##  chr [1:224] "23051" "1936" "1574" "1301" "563" "6963" "1773" "4995" "80" ...
reshaped_dataset$Deaths <- as.numeric(as.character(reshaped_dataset$Deaths))

# Creating the scatter plot

p <- ggplot(reshaped_dataset, aes(x = year, y = Deaths, color = Condition)) +
  geom_point(size = 3) + #adjusting plot point size
  labs(title = "Deaths associated with Neurological Conditions", subtitle = "Between the Years 2001 to 2014",
       x = "Year", 
       y = "Number of Deaths", caption = "Source: GOV.UK") + #labeling the x and y axis and citing data source
  theme_minimal() + #minimal for a clean look
  theme(plot.title = element_text(size = 16, face = "bold"), 
        #format title size and boldness
        plot.subtitle = element_text(hjust = 0.5, size = 12, face = "italic"), 
        #format subtitle size, potition and italics
        legend.title = element_text(size = 11, face = "bold"), 
        #format legend title size and boldness
        axis.title = element_text(size = 14), 
        #format axis title size
        axis.text = element_text(size = 12), 
        #format axis text size
        plot.background = element_rect(fill = "snow")) + 
  #fill background colour
  
#Colour code and rename labels  
         
  scale_y_log10(labels = scales::comma) +
  
#Colour coding conditions, making sure they will be visable on the graph and not clash with other colours
  
  scale_color_manual(values = c( "dwnc" = "red", "epi" = "green", "mndsma" = "lightcoral", "msid" = "pink", "nmd"= "purple", "poed" = "orange", "tbsi" = "magenta", "totns" = "cyan", "at" = "blue", "cnsi" = "dodgerblue", "cnd" = "darkgreen", "dd" = "lightblue", "fd" = "violet", "ham" = "gold", "raond"= "darkblue", "smar" = "darkgray"),
                     
 #Renaming neurological conditions
 
labels = c( "dwnc" = "Deaths with Mention of Neurological Condition", "epi" = "Epliepsy", "mndsma" = "Motor Neurone Disease and Spinal Muscular Atrophy", "msid" = "Multiple Sclerosis and Inflammatory Disorders", "nmd" = "Neuromuscular Diseases", "poed" = "Parkinsonism and Other Etrapyrimidal Disorders", "tbsi" = "Traumatic brain and Spinal Injury", "totns" = "Tumours of the Nervous System", "at"= "Ataxia", "cnsi" = "CNS Infections", "cnd" = "Cranial Nerve Damage", "dd" = "Development Disorders", "fd" = "Functional Disorders", "ham" = "Headaches and Migraines", "raond" = "Rare and Other Neurological Diseases", "smar" = "Spondylotic Myelophathy and Radiculopathy")) +
  
# Ammend plot 
  coord_flip() + #flip plot to horizontal
  theme(axis.line = element_line(color = "black")) #adding black axis lines

print(p) #print the ggplot

ggplotly(p) #interactive plot

The final output is saved in the figs folder

#Saving the plot
ggsave(here("figs","neurodeathsplot.png"))
## Saving 7 x 5 in image

Summary

Interpretation

The visualisation shows the death rates associated with neurological conditions. The neurological conditions with the highest rate from 2001-2014 is Parkinsonism and Other Etrapyrimidal Disorders.This is an interesting analysis as disorders within this category such as Parkinson’s disease affect a large number of individuals within the UK and achknowledging the number of deaths associated with the disorder could help to show the public the importance of donating to charities such as Parkinson’s UK to make a difference for people living with Parkinson’s disease. Furthmore, this visualisation creates a comparison for the different neurological conditions and how they were associated with a number of deaths. A limitation of this is that the data is from the years 2001-2014 so an updated dataset would be useful to compare the numbers of deaths within the last ten years.

Follow-ups

To investigate further, I would include the age of participants and run a regression analysis to test if there is a correlation between neurological disorder deaths and old age. I would also follow up this analysis with an up to date dataset to compare the results to see if there are any changes to the neuroloigcal conditions with the highest rate of associated deaths.

Conclusion

From completing this module and code project, I have learned how to use R efficiently after having no previous knowledge of the software. I have also been able to understand my mistakes and learn how to correct them within R. If I was to do this project again with more time and data accessibility, I would have followed through with my previous choice of public health data looking into a comparison of the Western and Mediterranean diet. By comparing the effects on life expectancy and disease. I would have chosen data from individuals that follow either diet and compare their health to see if there is a correlation of disease and shorter life expectancy when following the Western diet in order to understand the effects of what we consume on our health and present this data to show others that the cheap and convenient lifestyle of the Western diet can have a large impact on our lives.